For the Duncan seed predation/ seed removal data, some of the ecological questions are:
-How does the probability of seed removal vary as a function of distance from the forest edge (10 or 25 m)?
-With species, possibly as a function of seed mass?
-By time?
Since most of the predictor variables are categorical in this case (species; distance from forest), the deterministic models are relatively simple- different probabilities for different levels of the factors.
On the other hand, the distribution of number of seeds taken is unusual so most of the initial modeling effort will go into finding an appropriate stochastic model.
library(emdbook)
library(bbmle)
library(ggplot2)
library(gplots)
Getting started: cleaning up the dataset
data(SeedPred)
# Drop NAs and records where 0 seeds available
SeedPred <- na.omit(subset(SeedPred, available > 0))
attach(SeedPred)
# a subset of the data where we use only cases when seeds were taken
nz <- subset(SeedPred, taken > 0)
First step: Examine your data graphically
# barcharts of different combinations (excluding data where 0 seeds are
# taken)
barchart(table(nz$taken, nz$available, nz$dist, nz$species), stack = F, auto.key = TRUE,
scales = list(relation = "free"))
# distance is in green, number of seeds available is in red, the number
# taken is on the y, and the color bars represent species
barchart(table(nz$taken, nz$species, nz$dist, nz$available), stack = F, auto.key = TRUE)
barchart(table(nz$species, nz$available, nz$dist, nz$taken), stack = F, auto.key = TRUE,
scales = list(relation = "free"))
barchart(table(nz$available, nz$dist, nz$taken), stack = F)
barchart(table(nz$available, nz$species, nz$taken), stack = F)
# plot of all the data
barchart(table(available, dist, taken), stack = F, auto.key = TRUE, scales = list(relation = "free"))
# plot by data
tcumfac <- cut(nz$tcum, breaks = c(0, 20, 40, 60, 180))
barchart(table(nz$available, tcumfac, nz$taken), stack = F)
What patterns do we see in these histograms? We know these data are discrete and the results have an upper limit
barchart(table(nz$available, nz$taken), stack = F, auto.key = T, xlab = "Frequency",
ylab = "available")
We've graphed our data in many different ways, so now we need to select a distribution model.
The data set doesn't look like a zero-inflated binomial because the distribution is lowest in the middle and increases gradually for higher or lower values.
So Bolker tried the beta-binomial distribution, which allows for variability in the underlying probabilities per trial and can be bimodal at 0 and N for extreme values of the overdispersion parameter, and a zero-inflated beta-binomial distribution.
Note: One should really test the fits of distributions on a small piece of the data set or allowing for different parameters for each combination of factors; variation among groups can mask the shape of the underlying distribution.
Using the dzinbinom function in the emdbook package as a model, he constructed probability density functions for the zero-inflated binomial (dzibinom) and zero-inflated beta-binomial (dzibb):
# Zero-inflated binomial
dzibinom <- function(x, prob, size, zprob, log = F) {
logv <- log(1 - zprob) + dbinom(x, prob = prob, size = size, log = T)
logv <- ifelse(x == 0, log(zprob + exp(logv)), logv)
if (log)
logv else exp(logv)
}
# Zero-inlfated beta-binomial
dzibb <- function(x, size, prob, theta, zprob, log = F) {
logv <- ifelse(x > size, NA, log(1 - zprob) + dbetabinom(x, prob = prob,
size = size, theta = theta, log = T))
logv <- ifelse(x == 0, log(zprob + exp(logv)), logv)
if (log)
logv else exp(logv)
}
Bolker used the formula interface to mle2 rather than writing an explicit negative log-likelihood function. Since the zero-inflation probability must be between 0 and 1, he fitted it on a logit scale, using plogis to transform it on the fly:
(theta = overdispersion parameter)
(logitzprob = zero inflated parameter)
SP.zibb <- mle2(taken ~ dzibb(size = available, prob, theta, plogis(logitzprob)),
start = list(prob = 0.5, theta = 1, logitzprob = 0), data = SeedPred)
# He suspects the zero-inflaction paramter (logitzprob) and the
# overdispersion parameter (theta) might both be affecting the number of
# zeros, so he checked the correlations among paramaters:
cov2cor(vcov(SP.zibb))
## prob theta logitzprob
## prob 1.0000 0.2906 0.9870
## theta 0.2906 1.0000 0.3453
## logitzprob 0.9870 0.3453 1.0000
# looks like he's right, logitzprob and prob are 99% correlated-
# suggesting that we could drop the zero-inflation parameter from the
# model.
# New model without zero-inflation parameter
SP.bb <- mle2(taken ~ dbetabinom(size = available, prob, theta), start = list(prob = 0.5,
theta = 1), data = SeedPred)
# Lets compare the models with and without logitzprob:
logLik(SP.bb) - logLik(SP.zibb)
## 'log Lik.' 0.07957 (df=2)
# The log-likelihood difference is only about 0.08. Not a significant
# difference.
# Just in case, let's try the zero-inflated binomial as well:
SP.zib <- mle2(taken ~ dzibinom(size = available, prob = p, zprob = plogis(logitzprob)),
start = list(p = 0.2, logitzprob = 0), data = SeedPred)
# now compare the 3 models
AICtab(SP.zib, SP.zibb, SP.bb, sort = T, weights = T)
## dAIC df weight
## SP.bb 0.0 2 0.746
## SP.zibb 2.2 3 0.254
## SP.zib 419.5 2 <0.001
Looking at the dAIC and weight, which model should we choose?
Next step: To calculate this distribution for the data we need to compute the table of number-taken-by-number-available and compare this to random data of the zero inflated beta-binomial distribution
#
comb <- table(taken, available)
pcomb <- sweep(comb, 2, colSums(comb), "/")
# '/' means: divide columns by the sums, giving a percent of the total for
# each
pcomb
## available
## taken 1 2 3 4 5
## 0 0.941509 0.935698 0.916814 0.931949 0.882589
## 1 0.058491 0.017738 0.028319 0.026528 0.032614
## 2 0.000000 0.046563 0.015929 0.012687 0.016056
## 3 0.000000 0.000000 0.038938 0.006920 0.007025
## 4 0.000000 0.000000 0.000000 0.021915 0.011039
## 5 0.000000 0.000000 0.000000 0.000000 0.050677
# now we create the predicted data of the zero inflated beta binomial
mtab <- matrix(0, nrow = 6, ncol = 5)
for (N in 1:5) {
cvals = coef(SP.zibb)
mtab[1:(N + 1), N] = dzibb(0:N, size = N, prob = cvals["prob"], theta = cvals["theta"],
zprob = plogis(cvals["logitzprob"]))
}
mtab
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.93471 0.91977 0.91140 0.90560 0.90118
## [2,] 0.06529 0.02988 0.02511 0.02318 0.02212
## [3,] 0.00000 0.05035 0.01971 0.01545 0.01371
## [4,] 0.00000 0.00000 0.04378 0.01598 0.01204
## [5,] 0.00000 0.00000 0.00000 0.03979 0.01396
## [6,] 0.00000 0.00000 0.00000 0.00000 0.03700
Do these tables look similar? Where do they differ?
Next we can calculate standard Pearson X2 p-values for the probability of the observed numbers taken for each number of seeds available:
pval <- numeric(5)
for (N in 1:5) {
obs = comb[1:(N + 1), N]
prob = mtab[1:(N + 1), N]
pval[N] = chisq.test(obs, p = prob)$p.value
}
pval
## [1] 5.261e-01 2.897e-01 8.102e-01 1.163e-02 5.827e-05
There are still statistically significant discrepancies between the expected and observed distributions when 4 or 5 seeds are available. We could try to find a way to make the stochastic model more complex and accurate, but we have reached the limit of what we can do with simple models, and we may also have reached the limit of what we can do with the data
Differences Among Transects
Here Bolker wanted to parameterize the model so that mle2 would estimate the probability and overdispersion parameter for each distance, rather than estimating the parameters for the first transect and the difference between the first and second transect, so he used the formulas prob ~ dist- 1 and theta ~ dist- 1 to fit the model without an intercept.
SP.bb.dist <- mle2(taken ~ dbetabinom(prob, theta, size = available), parameters = list(prob ~
dist - 1, theta ~ dist - 1), start = as.list(coef(SP.bb)), data = SeedPred)
# A Likelihood Ratio test on the two models suggests a significant
# difference between transects
anova(SP.bb, SP.bb.dist)
## Likelihood Ratio Tests
## Model 1: SP.bb, taken~dbetabinom(size=available,prob,theta)
## Model 2: SP.bb.dist, taken~dbetabinom(prob,theta,size=available):
## prob~dist-1, theta~dist-1
## Tot Df Deviance Chisq Df Pr(>Chisq)
## 1 2 3622
## 2 4 3616 6.48 2 0.039 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Set up starting values, using qlogis (the logit transform) and log to transform the estimated values of the p and theta parameters from above.
startvals <- list(lprob = qlogis(coef(SP.bb.dist)["prob.dist10"]), ltheta = log(coef(SP.bb.dist)["theta.dist10"]))
SP.bb.dist2 <- mle2(taken ~ dbetabinom(plogis(lprob), exp(ltheta), size = available),
parameters = list(lprob ~ dist, ltheta ~ dist), start = startvals, data = SeedPred)
summary(SP.bb.dist2)
## Maximum likelihood estimation
##
## Call:
## mle2(minuslogl = taken ~ dbetabinom(plogis(lprob), exp(ltheta),
## size = available), start = startvals, data = SeedPred, parameters = list(lprob ~
## dist, ltheta ~ dist))
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## lprob.(Intercept) -2.79683 0.08140 -34.36 <2e-16 ***
## lprob.dist25 0.26630 0.11103 2.40 0.016 *
## ltheta.(Intercept) -1.12555 0.12614 -8.92 <2e-16 ***
## ltheta.dist25 -0.00358 0.17195 -0.02 0.983
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 3616
The summary of the model now gives us approximate p-values on the parameters, showing that the difference between transects is caused by a change in p and not a change in theta.
(The highly significant p-values for lprob. 10 and ltheta. 10 are not biologically significant: they merely show that logit(p10) does not equal 0 (i.e., p10 does not equal 0.5) and log theta10 does not equal 0 (theta10 does not equal 1), neither of which is ecologically interesting.)
So we reduce the model, allowing only p to vary between transects:
SP.bb.probdist <- mle2(taken ~ dbetabinom(plogis(lprob), exp(ltheta), size = available),
parameters = list(lprob ~ dist), start = startvals, data = SeedPred)
Both the LRT and the AIC approaches suggest that the best model is one in which p varies between transects but theta does not (although the AIC table suggests that the more complex model with differing theta should be kept in consideration):
anova(SP.bb, SP.bb.probdist, SP.bb.dist)
## Likelihood Ratio Tests
## Model 1: SP.bb, taken~dbetabinom(size=available,prob,theta)
## Model 2: SP.bb.probdist,
## taken~dbetabinom(plogis(lprob),exp(ltheta),size=available):
## lprob~dist
## Model 3: SP.bb.dist, taken~dbetabinom(prob,theta,size=available):
## prob~dist-1, theta~dist-1
## Tot Df Deviance Chisq Df Pr(>Chisq)
## 1 2 3622
## 2 3 3616 6.48 1 0.011 *
## 3 4 3616 0.00 1 0.983
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AICtab(SP.bb, SP.bb.probdist, SP.bb.dist, sort = T, weights = T)
## dAIC df weight
## SP.bb.probdist 0.0 3 0.6783
## SP.bb.dist 2.0 4 0.2496
## SP.bb 4.5 2 0.0721
c1 <- coef(SP.bb.probdist)
plogis(c(c1[1], c1[1] + c1[2]))
## lprob.(Intercept) lprob.(Intercept)
## 0.05752 0.07372
The difference is small- 6% vs. 7% probability of removal per observation. This difference is unlikely to be ecologically significant, and it reminds us that when we have a big data set (4406 observations) even small differences can be statistically significant. What would you do with this result?
Now he proceeded to test differences among species. First he tried a model with both theta and p varying. (Both parameters are again fitted on transformed scales, logit and log respectively.)
SP.bb.sp <- mle2(taken ~ dbetabinom(plogis(lprob), exp(ltheta), size = available),
parameters = list(lprob ~ species, ltheta ~ species), start = startvals,
data = SeedPred)
summary(SP.bb.sp)
## Maximum likelihood estimation
##
## Call:
## mle2(minuslogl = taken ~ dbetabinom(plogis(lprob), exp(ltheta),
## size = available), start = startvals, data = SeedPred, parameters = list(lprob ~
## species, ltheta ~ species))
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## lprob.(Intercept) -1.9255 0.1428 -13.49 < 2e-16 ***
## lprob.speciescd 0.3292 0.2186 1.51 0.13211
## lprob.speciescor -1.3330 0.2144 -6.22 5.1e-10 ***
## lprob.speciesdio -0.9915 0.2111 -4.70 2.6e-06 ***
## lprob.speciesmmu -0.4324 0.2130 -2.03 0.04237 *
## lprob.speciespol 0.4131 0.2098 1.97 0.04895 *
## lprob.speciespsd -1.2744 0.2207 -5.77 7.7e-09 ***
## lprob.speciesuva -1.3029 0.2146 -6.07 1.3e-09 ***
## ltheta.(Intercept) -0.8243 0.2240 -3.68 0.00023 ***
## ltheta.speciescd -0.5608 0.3473 -1.61 0.10635
## ltheta.speciescor 0.0161 0.3292 0.05 0.96106
## ltheta.speciesdio -0.3780 0.3276 -1.15 0.24858
## ltheta.speciesmmu -0.6186 0.3354 -1.84 0.06515 .
## ltheta.speciespol 0.1529 0.3331 0.46 0.64628
## ltheta.speciespsd -0.1734 0.3405 -0.51 0.61053
## ltheta.speciesuva -0.0590 0.3341 -0.18 0.85992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 3460
Results suggest that, as in the case of differences among transects, differences in p and not in theta are driving the differences among species:
SP.bb.probsp <- mle2(taken ~ dbetabinom(plogis(lprob), exp(ltheta), size = available),
parameters = list(lprob ~ species), start = startvals, data = SeedPred)
anova(SP.bb.sp, SP.bb.probsp, SP.bb)
## Likelihood Ratio Tests
## Model 1: SP.bb.sp,
## taken~dbetabinom(plogis(lprob),exp(ltheta),size=available):
## lprob~species, ltheta~species
## Model 2: SP.bb.probsp,
## taken~dbetabinom(plogis(lprob),exp(ltheta),size=available):
## lprob~species
## Model 3: SP.bb, taken~dbetabinom(size=available,prob,theta)
## Tot Df Deviance Chisq Df Pr(>Chisq)
## 1 16 3460
## 2 9 3470 9.39 7 0.23
## 3 2 3622 152.29 7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AICtab(SP.bb, SP.bb.probsp, SP.bb.sp, sort = T, weights = T)
## dAIC df weight
## SP.bb.probsp 0.0 9 0.9093
## SP.bb.sp 4.6 16 0.0907
## SP.bb 138.3 2 <0.001
Now Bolker wants to know whether seed mass and p are related. If they were, we could fit a likelihood model where p was treated as a function of seed mass, reducing the number of parameters to estimate and perhaps allowing us to predict removal probabilities for other species on the basis of their seed masses.
SP.bb.probsp0 <- mle2(taken ~ dbetabinom(plogis(lprob), exp(ltheta), size = available),
parameters = list(lprob ~ species - 1), start = startvals, method = "L-BFGS-B",
lower = rep(-10, 9), upper = rep(10, 9), data = SeedPred)
SP.bb.probsp0
##
## Call:
## mle2(minuslogl = taken ~ dbetabinom(plogis(lprob), exp(ltheta),
## size = available), start = startvals, method = "L-BFGS-B",
## data = SeedPred, parameters = list(lprob ~ species - 1),
## lower = rep(-10, 9), upper = rep(10, 9))
##
## Coefficients:
## lprob.speciesabz lprob.speciescd lprob.speciescor lprob.speciesdio
## -1.902 -1.649 -3.201 -2.957
## lprob.speciesmmu lprob.speciespol lprob.speciespsd lprob.speciesuva
## -2.446 -1.484 -3.192 -3.191
## ltheta
## -1.029
##
## Log-likelihood: -1735
predprob <- plogis(coef(SP.bb.probsp0))[1:8]
SP.bb.ci <- plogis(confint(SP.bb.probsp0, method = "quad"))[1:8, ]
# Figure 8.9
SeedPred_mass
## abz cd cor dio mmu pol psd uva
## 0.11 0.03 0.21 0.16 0.35 0.01 0.52 0.27
plotCI(y = predprob, x = SeedPred_mass, ui = SP.bb.ci, li = SP.bb.ci)
Rather than the possible trend toward higher seed removal for larger seeds that Bolker expected, the figure shows elevated removal rates for the three smallest-seeded species (explained by Duncan and Duncan as a possible artifact of small seeds being washed out of the trays by rainfall), and a somewhat elevated rate for species mmu; in this case, he would want to go back and see if there was something special about this species' characteristics or the way it was handled in the experiment.
The initial scan of the data suggested that some species might be more sensitive to the distance from the edge: This possibility is certainly biologically sensible (some species might be taken by specialized seed predators that have more restricted movement), and it is the kind of information that could easily be masked by looking at aggregated data. Using the formula interface, we can simply say lprob ~ species* dist to allow for such an interaction: if you need to code such a model by hand, interaction( f1, f2) will create a factor that represents the interaction of factors f1 and f2.
SP.bb.probspdist <- mle2(taken ~ dbetabinom(plogis(lprob), exp(ltheta), size = available),
parameters = list(lprob ~ species * dist), start = startvals, method = "L-BFGS-B",
lower = rep(-10, 9), upper = rep(5, 9), data = SeedPred)
anova(SP.bb.probsp, SP.bb.probspdist)
## Likelihood Ratio Tests
## Model 1: SP.bb.probsp,
## taken~dbetabinom(plogis(lprob),exp(ltheta),size=available):
## lprob~species
## Model 2: SP.bb.probspdist,
## taken~dbetabinom(plogis(lprob),exp(ltheta),size=available):
## lprob~species*dist
## Tot Df Deviance Chisq Df Pr(>Chisq)
## 1 9 3470
## 2 17 3455 15.3 8 0.054 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AICtab(SP.bb, SP.bb.probsp, SP.bb.probspdist, SP.bb.sp, SP.bb.probdist, SP.bb.dist,
weights = T, sort = T)
## dAIC df weight
## SP.bb.probsp 0.0 9 0.5587
## SP.bb.probspdist 0.7 17 0.3855
## SP.bb.sp 4.6 16 0.0557
## SP.bb.probdist 133.8 3 <0.001
## SP.bb.dist 135.8 4 <0.001
## SP.bb 138.3 2 <0.001
AIC says that the model without distance × species interaction is best, but only by a little bit
Check the relationships between proportion removed and time interval, and between time interval and date.
# figure 8.10
plot(tint ~ date, ylab = "Interval(days)")
mean.prop.taken <- tapply(taken/available, tint, mean, na.rm = T)
sd.prop.taken <- tapply(taken/available, tint, sd, na.rm = T)
n.tint <- table(tint)
se.prop.taken <- sd.prop.taken/sqrt(n.tint)
plotCI(x = mean.prop.taken, y = NULL, uiw = se.prop.taken, liw = se.prop.taken,
err = "y")
Try at home: Do the predictions improve, or the conclusions change, if I account for the time interval allowed for removal?